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Abstract 

Among vertebrates, most of the largest genomes are found within the salamanders, a clade of amphibians that includes 61 3 
species. Salamander genome sizes range from —14 to —120 Gb. Because genome size is correlated with nucleus and cell 
sizes, as well as other traits, morphological evolution in salamanders has been profoundly affected by genomic gigantism. 
However, the molecular mechanisms driving genomic expansion in this clade remain largely unknown. Here, we present the 
first comparative analysis of transposable element (TE) content in salamanders. Using high-throughput sequencing, we 
generated genomic shotgun data for six species from the Plethodontidae, the largest family of salamanders. We then 
developed a pipeline to mine TE sequences from shotgun data in taxa with limited genomic resources, such as salamanders. 
Our summaries of overall TE abundance and diversity for each species demonstrate that TEs make up a substantial portion of 
salamander genomes, and that all of the major known types of TEs are represented in salamanders. The most abundant TE 
superfamilies found in the genomes of our six focal species are similar, despite substantial variation in genome size. However, 
our results demonstrate a major difference between salamanders and other vertebrates: salamander genomes contain much 
larger amounts of long terminal repeat (LTR) retrotransposons, primarily Ty3/gypsy elements. Thus, the extreme increase in 
genome size that occurred in salamanders was likely accompanied by a shift in TE landscape. These results suggest that 
increased proliferation of LTR retrotransposons was a major molecular mechanism contributing to genomic expansion in 
salamanders. 

Key words: LTR retrotransposon, transposable element landscape, genomic expansion, TE age distributions, genome size 
evolution, plethodontid salamanders. 



Introduction 

Genomes dictate phenotype via their gene and regulatory se- 
quences, which control the production of proteins underlying 
organismal development and function. However, genomes 
also impact phenotype via their overall size, irrespective of 
their DNA sequence. Genome size can have profound effects 
on organismal biology, potentially affecting traits as diverse 
as nucleus size, cell size, duration of the cell cycle, cell differ- 
entiation rate, metabolic rate, embryonic developmental rate, 
limb regeneration rate, life history strategy, invasiveness, and 



extinction rate (Olmo and Morescalchi 1975; Sessions 
and Larson 1987; Jockusch 1997; Gregory 2003; Gregory 
2005b), but see (Lynch 2007). Within animals, genome size 
varies 6,650-fold (0.02-130 Gb), with 530-fold variation 
within the vertebrates alone (0.34-130 Gb) (Gregory 
2011). Understanding both the molecular mechanisms and 
the evolutionary forces shaping this variation remains a central 
goal in biology (Vinogradov 2004; Oliver et al. 2007). 

Transposable elements (TEs) are mobile DNA sequences 
that can insert into new genomic locations, often replicating 
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themselves during the process (Craig et al. 2002). Two clas- 
ses of TEs exist that differ in the molecular mechanism by 
which they transpose from one genomic location to an- 
other: Class I TEs (retrotransposons) transpose via a "copy 
and paste" mechanism, utilizing an RNA intermediate, 
and generating a new TE copy that inserts into a novel ge- 
nomic location. Most Class II TEs (DNA transposons) trans- 
pose via a "cut and paste" mechanism, utilizing a DNA 
intermediate and moving to a new genomic location with- 
out an obligate increase in copy number (Craig et al. 2002; 
Wicker et al. 2007). These two TE classes are further subdi- 
vided into subclasses, superfamilies, families, etc. based on 
structural features, details of the transposition mechanism, 
and sequence similarity (Wicker et al. 2007). Both TE classes 
coexist in a wide range of eukaryotes, suggesting their an- 
cient evolutionary origins. However, extreme variation in 
the number, activity, and diversity of TEs occurs in the ge- 
nomes of different species, both within and among the 
major eukaryotic clades (Goodier and Kazazian 2008). 

TEs, and other types of repetitive DNA, make up the bulk 
of many eukaryotic genomes and are a major determinant 
of genome size and architecture (Pritham 2009; Venner 
et al. 2009). The effects of individual TE insertions on the 
"host" organism can also vary dramatically. Although some 
TE sequences have been domesticated by their hosts and 
now form critical components of genes and/or gene regu- 
latory networks (Volff 2006; Feschotte 2008), TE insertions 
can be deleterious because they disrupt gene expression or 
protein function following insertion into coding or regula- 
tory regions of the genome (Montgomery et al. 1 987). More 
generally, TE insertions can negatively impact the host 
through 1) energetic costs of replication, transcription, 
and translation (Cavalier-Smith 2005); 2) disruptions of cel- 
lular processes by TE proteins (Nuzhdin 1999); 3) suscepti- 
bility to harmful gain-of-function mutations (De Gobbi et al. 
2006); and 4) deletions and rearrangements caused by ec- 
topic recombination between copies of the same TE family 
(Petrov et al. 2003). As a consequence, eukaryotic cells have 
evolved sophisticated machineries to silence TE proliferation 
and protect vital parts of the genome from TE insertion 
(Slotkin and Martienssen 2007; Lisch and Bennetzen 
2011). However, the extreme variation in TE diversity and 
abundance among eukaryotic genomes suggests that the 
balance between TE proliferation and host silencing differs 
dramatically across the tree of life. The evolutionary pro- 
cesses affecting this balance remain poorly understood, de- 
spite the central role of TEs in shaping genome evolution 
(Venner et al. 2009). 

TEs can also impact their host by affecting genome size. 
Proliferation and deletion of TEs cause genomic expansion 
and contraction, respectively (Petrov 2002; Bennetzen et al. 
2005; Gregory 2005a; Vitte and Panaud 2005; Devos 201 0), 
which can affect genome size's organism-level correlates 
(e.g., cell size and developmental rate) (Roth et al. 1997; 



Gregory 2005b). Such effects can be positive or negative, 
thereby enabling selection to act indirectly on TE content. 
The efficiency of such selection is determined by population 
genetic parameters such as effective population size (Lynch 
2007; Lynch et al. 2011). Thus, genome size and content 
likely reflect a dynamic interaction between molecular pro- 
cesses (TE dynamics and host silencing) and selection acting 
on organismal traits (Bennetzen and Kellogg 1997; Agren 
and Wright 201 1). Clades with extreme genome sizes pro- 
vide critical test cases in which to explore this interaction; 
they represent instances where an unusual balance has been 
struck among these evolutionary forces. 

Among vertebrates, most of the largest genomes are 
found within salamanders, a clade of amphibians that 
includes 613 recognized species (AmphibiaWeb 2011) 
(fig. 1). Salamander genome sizes range from —14 to 
— 120 Gb; these values are larger than all bird, mammal, 
reptile, and frog genomes, as well as most "fish" genomes 
(Gregory 2011), although extensive synteny conservation 
does exist between salamanders and other tetrapods (Voss 
et al. 201 1). Karyotype and DNA reassociation kinetic studies 
have shown that salamanders' large genomes reflect high 
levels of repetitive DNA rather than polyploidy; however, such 
repeat elements remain almost completely uncharacterized, 
and TE silencing in salamanders remains unexplored (Green 
1991; Sessions and Kezer 1991; Batistoni et al. 1995; 
Marracci et al. 1996). In contrast, organismal correlates of 
large genome size have been well characterized in salaman- 
ders, particularly in the Plethodontidae, the largest family 
(417 species, genome size range —14 to —74 Gb), where 
morphological evolution has been profoundly shaped by ge- 
nomic gigantism (Hanken and Wake 1993). For example, 
constraints on the number of large cells that can fit into 
the braincase, as well as slow cell division and differentiation 
rates, have caused substantial simplification of the nervous 
and visual systems (e.g., low numbers of retinal and optic tec- 
tum neurons) (Sessions and Larson 1987; Roth et al. 1994; 
Roth et al. 1997). Such simplification reduces visual acuity 
(Hanken and Wake 1 993; Roth et al. 1 994); however, pletho- 
dontids have evolved compensatory visual adaptations (e.g., 
increased allocation of their brains to the optic tectum) 
(Wiggers and Roth 1991). Other compensatory adaptations 
are found in the circulatory system, where some miniaturized 
plethodontids have evolved enucleated red blood cells, likely 
to overcome physical constraints associated with circulating 
huge cells (Mueller et al. 2008). These examples suggest that 
plethodontids have evolved features that offset deleterious 
effects imposed by their expanding genomes (Wiggers and 
Roth 1 991 ; Roth et al. 1 997), indicating that an unusual bal- 
ance between TE proliferation, host silencing, and selection 
on organism-level traits underlies the huge genome sizes in 
salamanders. 

Although studies integrating organismal biology and TE 
dynamics have recently been initiated in the avian clade, 
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Fig. 1. — Summary of nuclear genome sizes for 13 vertebrate clades. Data are compiled from the Animal Genome Size Database (Gregory 201 1) 
Sample sizes (number of species summarized) are in parentheses following clade names. 



which has experienced genome size reduction (e.g., Organ 
et al. 2007), relatively little attention has been paid to ver- 
tebrate genome size evolution at the large end of the size 
spectrum (but for notable exceptions, see Smith et al. 2009; 
Voss et al. 201 1). The repetitive landscapes of salamanders' 
huge genomes remain largely uncharacterized, and hypoth- 
eses integrating TE dynamics and organism-level selection 
remain untested. Here, we begin to fill this gap by using 
low-coverage high-throughput shotgun sequencing to gen- 
erate genomic data for six species of salamanders and 
leveraging these data to perform the first comprehensive 
analysis of TE landscapes in the salamander clade. We de- 
veloped a pipeline to mine TE sequences from low-coverage 
shotgun reads and estimate TE abundance and diversity, al- 
lowing us to make comparisons 1) between salamanders 
and other vertebrates with more "typical" (i.e., smaller) ge- 
nome sizes, as well as 2) among the different salamander 
species. Our results show that salamander genomes contain 
all of the main TE superfamilies identified in well-annotated 
eukaryotic genomes. Across our six focal species, the most 
abundant TE superfamilies are very similar, and Ty3/gypsy 
elements (Class I retrotransposons) are by far the most abun- 
dant in all species examined. However, our results demon- 
strate a substantial difference between salamanders and 
other vertebrates: salamander genomes accumulate much 
larger amounts of long terminal repeat (LTR) retrotranspo- 
sons. More generally, our results emphasize the importance 
of studying "outlier" taxa to generate a more comprehensive 
picture of vertebrate genome evolution. 



Materials and Methods 

Taxon Selection 

We chose to generate low-coverage data from multiple 
taxa, rather than deep coverage data from a single taxon, 
in order to identify shared genomic features characteristic 
of the salamander clade. We focused our analyses on the 
family Plethodontidae, which contains more than two-thirds 
of extant salamander species. Plethodontids have been the 
focus of much genome size evolution research (Sessions and 
Larson 1987; Roth et al. 1994, 1997; Jockusch 1997), pro- 
viding context for our genomic analyses. Six species of ple- 
thodontids were chosen that span the deepest phylogenetic 
split within the family: subfamily Plethodontinae (Aneides 
flavipunctatus and Desmognathus ochrophaeus) and Hemi- 
dactyliinae (Batrachoseps nigriventris, Bolitoglossa occiden- 
talis, Bolitoglossa rostrata, and Eurycea tynerensis) (Vieites 
et al. 201 1). These taxa encompass a range of the smaller 
genome sizes found in the clade (~1 5 to ~ 47 Gb; the larg- 
est plethodontid genome is —70 Gb) (Gregory 201 1). The 
phylogenetic relationships among the six species are ((((£. 
occidentalis, B. rostrata), B. nigriventris), E. tynerensis), 
(A. flavipunctatus, D. ochrophaeus)). Divergence dates in 
salamanders remain the topic of much debate. The basal 
split within plethodontids has been dated at ~ 40 to 
— 130 Myr, depending on data set and analytical technique; 
divergence time estimates for our six focal taxa are similarly 
varied, but all are >25 Myr (Mueller 2006; Marjanovic and 
Laurin 2007; Kozak et al. 2009; Zhang and Wake 2009; 
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Voucher Information Genome Size (Gb) Number of Reads Total Number of base pairs Percentage of Coverage 



Table 1 

Specimen Information and Shotgun Sequencing Results 



Species 

Aneides flavipunctatus RLM172 44 

Batrachoseps nigriventris ELJ1556 25 

Eurycea tynerensis RMB3457 25 a 

Desmognathus ochrophaeus UAHC 16065 15 

Bolitoglossa rostrata SMR 360 47 

Bolitoglossa occidentalis GP1395 43 



a Represents an average of nine other Eurycea species. 

Zheng et al. 2011). Genome size estimates and voucher 
specimen information is summarized in table 1. 

Shotgun Library Creation and Sequencing 

Total DNA was extracted from liquid-nitrogen snap-frozen 
liver or tail tissue by standard phenol-chloroform-isoamyl 
alcohol extraction methods or the Gentra Puregene tissue 
kit (Qiagen). 454 FLX-LR and 454 Titanium-XLR genomic 
shotgun libraries were prepared using the 454 shotgun li- 
brary preparation kits and protocols (Roche) for FLX and Ti- 
tanium sequencing, respectively. Libraries for Bolitoglossa 
occidentalis and B. rostrata were sequenced on the Roche 
454-FLX sequencing platform using FLX-LR sequencing kits, 
whereas all other species were sequenced on the Roche 
454-FLX platform with FLX-XLR Titanium reagents. Based 
on previous studies of complex plant genomes (e.g., barley 
and pea), we scaled our data collection efforts to produce 
~1 % genomic coverage (i.e., 0.01 x of the genome at 1 x 
depth), as this sequencing depth has been shown to yield rea- 
sonable summaries of TE abundance for elements present at 
> 1,000 copies/genome (Macas et al. 2007; Wicker et al. 
2009). Library preparation and sequencing were performed 
by the Consortium for Comparative Genomics at the University 
of Colorado School of Medicine (B. rostrata, B. occidentalis, 
and Desmognathus ochrophaeus) and the University of Idaho 
Institute for Bioinformatics and Evolutionary Studies Genomics 
Resources Core facility (Aneides flavipunctatus, Batrachoseps 
nigriventris, and Eurycea tynerensis). 

Initial Data Processing 

Mitochondrial reads were screened out from all data sets us- 
ing Blast with reference mitochondrial genome sequences 
from the same or closely related taxa (Mueller et al. 2004, 
2008). Next, shotgun reads from each data set were checked 
for sequencing artifacts generated by the presence of multiple 
beads and a single template in emPCR drops, which can po- 
tentially produce multiple identical sequences that can skew 
estimates of repeat element abundance (Gomez-Alvarez et al. 
2009; Niu et al. 2010). For data sets with <350 Mb of shot- 
gun reads, the online 454 Replicate Filter (http://microbio- 
mes.msu.edu/replicates/ [date last accessed 17 Nov 2011]) 
was used to filter out exact replicates (cutoff = 1 .0, length 
requirement = 1 .00, and initial base pair match = 3). For data 



1,044,399 308,615,225 0.70 

1,131,828 487,538,903 1.91 

1,089,945 389,972,620 1.59 

845,984 227,156,262 1.49 

183,143 40,553,103 0.09 

124,242 28,841,057 0.07 



sets with >350 Mb of shotgun reads, the locally installed 
cdhit-454 (http://weizhong-lab.ucsd.edu/cdhit_454/cgi-bin/ 
index.cgi?cmd = Introduction [date last accessed 26 Sep 
201 1]) was used to filter out exact replicates (-c 1.00 -aS 
0.9 -aL 0.6, other parameters set to default values). In total, 
0.70-4.89% of shotgun reads were identified as potential 
sequencing artifacts in each data set, and all such reads 
were removed from further analysis. Finally, repeat 
elements with significant sequence similarity to elements 
identified from well-annotated genomes were identified 
using RepeatMasker, with RepBase (version 16.04) (http:// 
www.girinst.org/ [date last accessed 26 Sep 201 1]) as a ref- 
erence library. 

We developed a pipeline to mine TE sequences from low- 
coverage shotgun sequence data representing unexplored ge- 
nomes. The pipeline includes five main steps, outlined below, 
and is summarized in supplementary file 1, Supplementary 
Material online. Most of the pipeline was automated by 
custom Perl scripts, which are available upon request. 

TE Mining Step 1: Identify and Classify Repeat 
Sequences from Shotgun Reads 

We used RepeatModeler (http://www.repeatmasker.org/ 
RepeatModeler.html [date last accessed 26 Sep 2011]) to 
identify de novo repetitive sequences for each species. To 
identify repeats, RepeatModeler combines de novo repeat 
detection programs RepeatScout (Price et al. 2005) and RE- 
CON (Bao and Eddy 2002), which use self-comparison and 
k-mer approaches, respectively. To classify de novo repeats, 
RepeatModeler generates consensus sequences from align- 
ments of similar reads and attempts to classify them using 
RepBase. Such consensus sequences from RepeatModeler 
were further classified using REPCLASS, software that auto- 
mates the classification of TEs based on homology, struc- 
ture, and target-site duplication (Feschotte et al. 2009). 
Following REPCLASS analyses, all de novo repeats initially 
identified by RepeatModeler were classified as "TE-derived 
repeats" or "unknown repeats." 

TE Mining Step 2: Assemble Shotgun Reads into Contigs 

We assembled shotgun reads from each of our focal ge- 
nomes into contigs using Phrap (http://phrap.org/ [date last 
accessed 26 Sep 201 1]) (minmatch = 20, other parameters 
set to default values) and PCAP (Huang et al. 2003) (all 
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parameters set to default values). Although our data provide 
only <1.9% coverage, TEs present in high copy number, 
with low sequence divergence, should be represented by 
composite contigs that span much of their length, including 
both coding and noncoding sequences (Macas et al. 2007; 
Swaminathan et al. 2007). 

TE Mining Step 3: Identify TE-Containing Contigs 

Following assembly, we used Blast to query the repeats iden- 
tified in Step 1 against the contigs generated in Step 2 to iden- 
tify contigs that include transposition-associated protein- 
coding sequences. Specifically, we started by using each TE-de- 
rived repeat from Step 1 (with the exception of SINEs, which 
encode no transposition-associated proteins) as a query to 
BlastN against the assembled contigs with an e-value threshold 
cutoff of e~ 10 . The top 20 hits for each such repeat were 
parsed to a file, and the sequence of each hit was used to 
BlastX against the amino acid sequences of TE-encoded 
proteins (http://www.repeatmasker.org/RepeatProteinMask. 
html#database [date last accessed 26 Sep 201 1 ]) to verify that 
the contig contained the expected target transposition- 
associated protein-encoding sequences. Then, the three lon- 
gest contigs that met these criteria were chosen to represent 
the query repeat, and these contigs were assigned to the same 
TE superfamily as the query repeat. 

We also analyzed repeats identified by RepeatModeler, 
but classified as "unknown" in Step 1 , in order to determine 
whether we could classify them successfully using our as- 
sembled contigs. We began by using all of the TE sequence 
contigs identified above to mask, using RepeatMasker, the 
set of unknown repeats identified in Step 1; reads that re- 
mained unmasked were extracted. Then, each unmasked 
repeat was queried using BlastN against the contigs gener- 
ated in Step 2 with an e-value threshold cutoff of e~ 10 . The 
top 3 hits were collected to represent the unknown repet- 
itive sequence. Finally, these collected sequences were que- 
ried using BlastX (e-value threshold cutoff of e~ 4 ) against 
the amino acid sequences of TE-encoded proteins to identify 
contigs that contained sequences encoding transposition- 
associated proteins, and each identified contig was assigned 
to the same TE superfamily as its first hit. 

TE Mining Step 4: Verify and Refine TE-Containing 
Contigs 

All the contigs we identified that contained transposition- 
associated protein-coding sequences were combined and 
sorted by length. We then examined each sequence to 
determine if it represented a complete full-length TE based 
on the following criteria: 1) Does the sequence contain in- 
tact coding regions for all relevant transposition-related 
proteins? This was determined using ORF Finder (http:// 
www.ncbi.nlm.nih.gov/gorf/gorf.html [date last accessed 
26 Sep 2011]), coupled with BlastX against the amino acid 
sequences of TE-encoded proteins. For elements (e.g., non- 



LTR retrotransposons and Helitrons) that lack diagnostic struc- 
tural features associated with their boundaries (e.g., LTRs or ter- 
minal inverted repeats [TIRs]), this was our sole criterion. 2) Does 
the sequence contain the hallmarks of TE sequence boundaries 
(e.g., LTRs or TIRs), indicating that then contig represents a full- 
length TE? This was determined using NCBI-Blast2 (http:// 
blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch& 
PROG_DEF=blastn&BLAST_PROG_DEF=megaBlast&BLAST_ 
SPEC = blast2seq [date last accessed 26 Sep 2011]). Additionally, 
contigs were checked to ensure that they lacked endogenous 
(non-TE) gene fragments and nested TE insertions using TBIastX 
against the amino acid sequences of frog annotated proteins 
(ftp://ftp.ncbi.nih.gov/genomes/Xenopus_Silurana_tropicalis/ 
protein/ [date last accessed 26 Sep 201 1 ]) and Repbase, respec- 
tively. Contigs were also checked to ensure that they were not 
dimers or other assembly artifacts formed by joining intact ele- 
ment sequences with additional partial, or complete, elements 
through misassembled LTR orTIR sequences. Finally, as a refer- 
ence, we searched for full-length TEs from the 1 6 bacterial arti- 
ficial chromosomes (BAC) clones of the salamander /4mbysfoma 
mexicanum available in GenBank (Smith et al. 2009). Ambysto- 
ma mexicanum is a representative of the salamander family Am- 
bystomatidae, which last shared a common ancestor with 
plethodontid salamanders —85-200 Ma (Marjanovic and Laurin 
2007; Zhangand Wake 2009;Zhengetal. 2011). Candidate full- 
length TEs were identified using the amino acid sequences of 
TE-encoded proteins(http://www.repeatmasker.org/RepeatPro- 
teinMask.html#database [date last accessed 26 Sep 201 1]) as 
queries to TBIastN against the BAC clone sequences. All regions 
that produced significant hits(e-values<e~ 10 ) were excised with 
5 kb of flanking regions. TIRs or LTRs were identified by NCBI- 
Blast2. 

TE Mining Step 5: Summarize the Overall TE Landscape 
of Each Species 

All of the refined contigs that encode transposition-related 
proteins (Step 4), all of the repeats derived from TEs that 
were not represented by any contigs (Steps 1 and 4), all 
of the unknown repeats (Step 1 ), and all of the repeats clas- 
sified as SINEs (Step 1 ) were combined to produce a species- 
specific repeat library for each of our focal taxa. Because 
none of our focal species is particularly closely related to 
any other (<25 Myr since common ancestry), masking spe- 
cies with the repeat libraries of other species did not improve 
our results. Using these libraries, we masked each genome 
with RepeatMasker to yield a comprehensive summary of 
the TE landscape of each species. The annotation file pro- 
duced by RepeatMasker was used to determine the TE di- 
versity and abundance within each species. All elements 
comprising >0.01% of our shotgun reads were ranked 
by abundance in each genome. Next, for each species, 
we calculated the total proportion of shotgun data anno- 
tated to the three main TE orders: 1) LTR retrotransposons, 
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2) non-LTR retrotransposons (including SINEs), and 3) DNA 
transposons. 

Comparison of the Salamander TE Landscape with 
Other Vertebrate Genomes 

To test whether salamanders' large genomes reflect a funda- 
mentally different TE landscape than is found in the genomes 
of other vertebrates with more typical genome sizes, we ob- 
tained summaries of TE content from five complete vertebrate 
genomes (Homo sapiens, Gallus gallus, Danio rerio, Anolis 
carolinensis, and Xenopus tropicalis) and compared the pro- 
portions of each genome composed of 1) LTR retrotranspo- 
sons, 2) non-LTR retrotransposons, and 3) DNA transposons. 
TE summaries for Homo sapiens, Gallus gallus, Anolis caroli- 
nensis, and Xenopus tropicalis were obtained from their 
genome publications (International Human Genome Sequenc- 
ing Consortium 2001; Hillier et al. 2004; Hellsten et al. 2010; 
Alfoldi et al. 201 1 ). The summary for Danio rerio was obtained 
using the out file of RepeatMasker from the University of 
California Santa Cruz genome browser (http://hgdownload. 
cse.ucsc.edu/goldenPath/danRer7/bigZips/ [date last accessed 
26 Sep 201 1 ]) and the genome assembly from the Danio rerio 
Sequencing Project(http://www.sanger.ac.uk/Projects/D_rerio/; 
Wellcome Trust Sanger Institute). 

Comparison of TE Landscapes among Salamanders 

Although our primary goal was to compare salamander ge- 
nomes with those of other vertebrates, we also compared 
TE content among our six focal taxa. To this end, we per- 
formed principal component analysis (PCA) on the relative 
abundances of different elements present in the genomes of 
Desmognathus ochrophaeus, Eurycea tynerensis, Aneides 
flavipunctatus, and Batrachoseps nigriventris, as these data 
sets represent fairly equivalent coverage (0.7-1 .9%); the fi- 
nal two species (Bolitoglossa rostrata and B. occidentalis) 
were excluded from this analysis because their coverage 
is much lower (0.07-0.09%), limiting our power to estimate 
TE abundance. 

TE Age Distributions and Element Proliferation History in 
Salamanders 

We analyzed the proliferation history of the most abundant 
superfamily from each TE class: the Gypsy superfamily (LTR 
retrotransposon), the L2/CR1 superfamily (non-LTR retro- 
transposon), and the Harbinger superfamily (DNA transpo- 
son). All shotgun reads masked by each superfamily were 
collected from the four species for which we had 0.7- 
1 .9% genome coverage (Desmognathus ochrophaeus, Eur- 
ycea tynerensis, Aneides flavipunctatus, and Batrachoseps 
nigriventris). RepeatScout was used to construct consensus 
sequences representing fragments of ancestral elements 
from all shotgun reads masked by each family; multiple di- 
vergent consensus sequences mapping to the same TE 



region represent different subfamilies (Macas et al. 2007). 
Such consensus sequences were used as a repeat library 
to mask the relevant reads with RepeatMasker, generating 
percent divergence estimates for each read from its ances- 
tral sequence. Corrected percent sequence divergences 
were then estimated using the Jukes-Cantor model of nu- 
cleotide substitution. Results were summarized as frequency 
histograms and represent summaries of superfamily-wide 
proliferation history. 

TE Proliferation Dynamics, TE Content, and Genome 
Size Comparisons across Salamander Species 

Our six focal taxa differ in genome size (table 1 ), encompass- 
ing a range of the large sizes found across the salamander 
clade (fig. 1). To test whether such differences reflect any 
aspect(s) of TE proliferation dynamics, we tested whether 
larger genomes showed evidence of either 1) more recent 
or 2) more frequent bursts of proliferation than smaller ge- 
nomes by comparing the shapes of the element age distri- 
butions across taxa. We also tested whether genome size 
differences primarily reflect variation in the abundance of 
specific TEs by testing whether PC scores for each PC axis 
were related to genome size. 

Results 

Shotgun Library Summary Statistics and Initial Data 
Processing 

The sequence data obtained for our six focal salamander 
species are summarized in table 1. Sequences have been 
deposited in the GenBank short read archive (accession 
numbers SRA046114.1, SRA046116.1, SRA046118.1, 
SRA046119.1, SRA046120.1, and SRA046121.1) and the 
DRYAD repository (doi:10.5061/dryad.308g1h54). The 
number of reads obtained per species ranges from 
124,242 to 1,131,828, and the total amount of sequence 
generated per species ranges from 28 to 487 Mb. Sequenc- 
ing coverage per species ranges from 0.07% to 1.91% of 
the genome; 0.01-0.06 % of this was screened out as mi- 
tochondrial sequence and 0.70-4.89% of this was filtered 
out as identical reads, likely sequencing artifacts generated 
during emPCR. 

Efficiency of Our TE-Mining Method for Low-Coverage 
Shotgun Read Data 

More than 260 Myr have elapsed since salamanders last 
shared a common ancestor with Xenopus, the most closely 
related organism with annotated TEs in RepBase (Marjanovic 
and Laurin 2007; Roelants et al. 2007). Thus, we anticipated 
low success identifying TEs based on sequence similarity to 
TEs known from other organisms. Consistent with this, our 
RepeatMasker analyses, using RepBase (16.04) as the repeat 
library, were largely unsuccessful; only —0.2-1.9% of our 
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Table 2 

Percentage of 454 Shotgun Data Classified Using Different Methods 





Repeat 


Repeat 








Masker/RepBase 


Modeler/REPCLASS 


Our TE-Mining 


Our TE-Mining Method 


Species 


(%) 


(%) 


Method (%) 


(% unclassified repeats) 3 


Aneides flavipunctatus 


1.91 


23.90 


47.52 


15.01 


Batrachoseps nigriventris 


1.29 


16.92 


39.39 


7.57 


Eurycea tynerensis 


1.15 


9.81 


25.18 


8.09 


Desmognathus ochrophaeus 


0.16 


9.41 


39.69 


11.98 


Bolitoglossa rostrata 


1.15 


3.50 


30.18 


17.79 


Bolitoglossa occidentalis 


1.64 


4.35 


33.19 


8.38 



a For comparison, we also show the percentage of data identified by our method as nonsimple repeats, but not classified as known TE sequence. 



data were recognized as TEs (table 2). Furthermore, because 
454 shotgun data consist of only short (<400 bp) reads, TE 
identification based on structural features and target site se- 
quence information is not feasible. Thus, we relied on de 
novo repeat detection methods (RepeatModeler) to iden- 
tify/classify candidate TE sequences in our data set and 
further classified them using REPCLASS. De novo salamander 
repeats classified as TEs were then used as repeat libraries to 
mask the shotgun reads of each species with RepeatMasker. 
Although these results were a significant improvement over 
our initial RepeatMasker runs (3.5-23.9% of each genome 
was classified as TEs, table 2), the majority of our shotgun 
reads remained unclassified. Examination of our repeat 
classification results showed that almost all classified repeats 
were derived from the conserved protein-coding portions of 
TEs. However, full-length TEs may also include large amounts 
of less conserved coding and noncoding sequences. Thus, our 
results suggested that the classification performed by Repeat- 
Modeler/REPCLASS was unable to identify shotgun reads 
mapping to less conserved TE regions, likely leading to severe 
underestimation of TE content in these largely unexplored 
genomes. 

To address this issue, we assembled all 454 shotgun reads 
for each species into contigs and identified those harboring 
sequences encoding transposition-related proteins. Such 
contigs, in turn, allowed us to classify sequences derived 
from less conserved coding and noncoding regions of TEs 
through their location on the same contig as classifiable 
TE-coding sequences. When we used such contigs as repeat 
libraries to mask our shotgun reads, we were able to classify 
25.18-47.52% of each data set as known TE sequences, 
representing a 20- to 200-fold increase over RepeatMasker 
analyses using RepBase as a library and a 2- to 9-fold in- 
crease over RepeatModeler/REPCLASS-based classification 
methods (table 2). Thus, our TE-mining pipeline is an im- 
provement in analytical tools available to characterize the 
repeat element landscape of large unexplored genomes us- 
ing low-coverage shotgun sequences. 

In addition, the assembly step of our TE-mining pipeline 
allowed us to successfully generate seven putatively full- 
length elements, composite sequences representative of 



salamander TE superfamilies. After verification and refine- 
ment, we confirmed contigs representing full-length se- 
quences of several superfamilies of Class I TEs: Ty3/gypsy, 
ERV1, DIRS, and Ngaro elements (LTR retrotransposons), 
as well as L1 and L2/CR1 elements (non-LTR retrotranspo- 
sons). In addition, we confirmed contigs representing 
a full-length rolling circle Helitron (Class II TE). The structures 
of the seven full-length TEs we assembled are summarized in 
figure 2, and each is largely consistent with the structure 
reported for the same superfamily from other eukaryotic ge- 
nomes. Sequences of these complete elements, as well as 
the full-length elements identified from Ambystoma mexi- 
canum BAC clones, are available as supplementary file 2, 
Supplementary Material online. To our knowledge, this is 
the first description of the structure of full-length TEs in sal- 
amander genomes. Our successful assembly of full-length 
contigs from ~1 % genome coverage (using a stringent as- 
sembly algorithm) indicates that all seven elements are pres- 
ent in very high copy number, and that little sequence 
divergence (<5-8% based on assembly parameters) exists 
among individual copies. This suggests that all seven TE 
superfamilies have been recently active and/or continue to 
be active in our focal salamander species. We tested whether 
ongoing transcription of these same superfamilies was also 
occurring in Ambystoma mexicanum using TBIastX against 
the A mex/canum transcriptome (http://www.ambystoma.org/ 
genome-resources/21 -blast [date last accessed 26 Sep 201 1]) 
and confirmed transcripts of all seven superfamilies. 

Summary of TE Landscapes across Salamander Species 

The proportion of 454 shotgun data classified as TEs in each 
species is summarized in table 2. Because we are working 
with low-coverage shotgun reads of largely unexplored ge- 
nomes, all of these numbers are underestimates of total TE 
content; they do not necessarily reflect proportions of the 
genome made up of low-copy-number TEs, TEs with no re- 
cent proliferation activity, or TE boundary sequences (see 
Discussion). Regardless, our results clearly demonstrate that 
TEs have played a substantial role in generating salamanders 
enormous genomes. For example, 47.52% of the shotgun 
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Fig. 2. — The structures of seven full-length TE sequences mined from salamander shotgun reads. Abbreviations: gag, capsid-like protein; pro, 
protease; RT, reverse transcriptase; rve, integrase; ENV, envelope protein; YR, tyrosine recombinase; EN, endonuclease. 



reads of Aneides flavipunctatus represent recognizable TEs. 
Note that an additional 1 5.01 % of this genome is unclassifi- 
able, but falls within the category of nonsimple repetitive se- 
quence, suggesting that they are interspersed repeats likely 
derived from transposition activity. These results are consis- 
tent with earlier DNA-DNA hybridization analyses, which 
showed high levels of repetitive sequence in salamander ge- 
nomes, as well as with limited recombinant DNA-based stud- 
ies identifying select TEs active in salamanders (Baldari and 
Amaldi 1976; Batistoni et al. 1995; Marracci et al. 1996). 

Our results show that salamander genomes harbor al- 
most all of the major TE types reported in previously char- 
acterized eukaryotic genomes. We identified 29 different TE 
superfamilies in total across the 6 species, 22 of which were 
present in two or more species (supplementary file 3, 
Supplementary Material online). The percentage of shotgun 
data mapping to each superfamily is depicted in figure 3 
(Aneides flavipunctatus) and summarized numerically in 
supplementary file 3, Supplementary Material online (all 
species) and depicted in supplementary file 4, Supplemen- 
tary Material online. Across all six species, the most abun- 
dant elements are Ty3/gypsy retrotransposons, comprising 
7-20% of the data set for each species. Ty3/gypsy elements 
were previously shown to exist at high copy numbers in 
the plethodontid genus Hydromantes based on cloning/hy- 
bridization analyses, although such methods failed to recover 
them from the genus Desmognathus (Marracci et al. 1996). 
Three other elements are also consistently among the most 
abundant across species: LINE/L2 non-LTR retrotransposons 



(1.7-8.8% of the genome), DIRS retrotransposons (2.0- 
5.7% of the genome), and LTR/ERV1 endogenous retrovi- 
ruses (0.5-1 1.3% of the genome). 



Comparison of the Salamander TE Landscape with 
Other Vertebrate Genomes 

Although the same TEs are present in salamanders as in 
most other vertebrates, our results indicate that the propor- 
tion of LTR retrotransposons is much higher in all six species 
of salamanders than it is in any of the other vertebrate 
genomes we examined (fig. 4). This pattern holds, despite 
substantial differences in both genome size and percentage 
of genomic coverage across our six focal salamander 
species. We emphasize that this difference is an underesti- 
mate of the true difference in LTR levels, as our analyses of 
low-coverage shotgun data underestimate the total TE con- 
tent of salamanders (see Discussion). Thus, LTR retrotranspo- 
sons underlie genomic gigantism in extant plethodontid 
salamanders. This result, in turn, suggests expansion of 
LTR retrotransposons as a likely molecular mechanism 
underlying genomic expansion at the base of the salaman- 
der clade. Further analyses that include basal salamander lin- 
eages, as well as analytical tools designed to identify highly 
divergent TE copies (Gu et al. 2008; Singh et al. 201 0), will 
allow an even more rigorous test of this hypothesis. 

Notably, genome content in salamanders differs most 
dramatically from Xenopus, the only other amphibian for 
which comparable data exist. The Xenopus TE landscape 
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Fig. 3. — The TE landscape of the Aneides flavipunctatus genome. Element superfamilies are ranked from most to least abundant along the x 



is largely composed of DNA transposons (fig. 4). Such exten- 
sive divergence in TE content, coupled with the extreme ge- 
nomic expansion seen in salamanders, points to amphibians 
as an interesting clade to target for more detailed analysis of 
genome evolution. 

Comparison of TE Landscapes among Salamanders 

Our PCA analyses summarize the main differences in TE 
landscape among four of our six focal taxa. All three PC axes 



are composed of TEs from all three classes (LTR retrotrans- 
posons, non-LTR retrotransposons, and DNA transposons), 
indicating that differences in genome content among taxa 
are not limited to differences in a specific type of TE (fig. 5). 
More generally, these results allow us to test whether ge- 
nome content is similar among taxa with more recent shared 
ancestry, similar genome sizes, or neither. Species show no 
clustering based on phylogenetic relationships, indicating 
that species are sufficiently diverged from one another 



Salamanders Other vertebrates 
50 n 




Fig. 4. — The TE landscape of salamanders compared with that of other vertebrates. Salamanders have higher relative levels of LTR 
retrotransposons. For Danio rerio, we did not include the 1 1 % of the genome identified as repetitive, but classified only as "DNA." 
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(>25 Myr) that their TE landscapes retain no pattern of 
shared ancestry. Finally, no PC scores for any axis were re- 
lated to genome size, indicating that groups of different TEs 
that vary in a correlated fashion do not explain genome size 
variation among these four species. 

Overall, we note that the total TE content estimated for 
our six focal species (table 2) does not match predictions 
based on genome size; for example, Desmognathus ochro- 
phaeus has the smallest genome but does not show the 
smallest proportion of TEs in our analyses. Although this pat- 
tern may reflect true differences in TE content, suggesting 
that genome size variation within salamanders reflects dif- 
ferences in non-TE DNA content, we conservatively attribute 
this discrepancy to limitations in our ability to detect TEs 
from low-coverage 454 shotgun data. For example, if the 
D. ochrophaeus genome contains a greater number of 
low-frequency TEs, or TEs with no recent proliferation activ- 
ity, we would fail to detect them in our analysis, leading to 
a greater underestimation of total TE content in this species. 

TE Proliferation History in Salamanders 

Sequence divergence distributions representing prolifera- 
tion history of the most abundant superfamilies in each 



TE class are shown in figure 6 (Ty 3/gypsy elements), supple- 
mentary file 5, Supplementary Material online (LINE/L2 ele- 
ments), and supplementary file 6, Supplementary Material 
online (DNA/Harbinger). Underthe assumption of a constant 
substitution rate, sequence divergence distributions are 
equivalent to age distributions. All distributions suggest on- 
going TE proliferation, indicated by element copies with se- 
quence divergence <1 % from the consensus (Novick et al. 
2010). Transcripts of all such superfamilies were detected 
in the Ambystoma mexicanum transcriptome database 
(http://www.ambystoma.org/genome-resources/21-blast 
[date last accessed 26 Sep 201 1]), suggesting that they are 
also transcriptionally active in this ambystomatid salaman- 
der species. In addition, all distributions include copies 
with high (<40%) sequence divergence, suggesting that 
elements reach fixation in salamander populations and 
are subsequently preserved in the genome for long periods 
of time; this pattern is consistent with a low negative impact 
of TE insertions on the host (Novick et al. 2009; Novick et al. 
2010). However, we emphasize that this pattern may also 
reflect our inability to assemble consensus sequences repre- 
senting all families/subfamilies within each superfamily from 
low-coverage shotgun data. Thus, additional data collection 
will be required to rigorously test this hypothesis. 
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Fig. 6. — Age distribution of Ty3/gypsy elements in four species of salamanders. 



TE Proliferation Dynamics, TE Content, and Genome 
Size Comparisons among Salamander Species 

In total, we generated sequence divergence distributions for 
the most abundant superfamily in each TE class from the 
four species for which we have 0.7-1 .9% coverage. The ge- 
nome sizes of these four species range from 15 to 44 Gb. 
We examined these distributions to determine whether this 
variation in genome size reflects 1) the frequency of bursts 
of TE proliferation and/or 2) how recently such bursts oc- 
curred. Our results show no such correlations; larger sala- 
mander genomes show no consistent pattern of having 
more frequent, or more recent, proliferation bursts. This re- 
sult, coupled with the results of our PCA showing that no PC 
scores for any axis are related to genome size, suggests that 
evolutionary changes in genome size among these four taxa 
have not been dictated solely by the tempo and mode of 
proliferation of any of the most abundant elements. How- 
ever, we emphasize that our sampling was designed to iden- 
tify differences between salamanders and other vertebrates; 
increased phylogenetic breadth, and sequencing depth, is 
required to test whether TE dynamics correlate with evolu- 
tionary changes in genome size within the salamander 
clade. 

Discussion 

Our results represent the first in-depth comparative analyses 
of the repetitive landscape of salamander genomes, the 
largest among the tetrapods and, with the exception of 
lungfish, among vertebrates as a whole (Gregory 2011). 



We demonstrate that 1) salamander genomes have fairly 
high TE content, including representatives of all of the major 
types of TEs found in well-annotated eukaryotic genomes, 
2) many TEs show evidence of recent and/or ongoing pro- 
liferation, and 3) Ty 3/gypsy elements are the most abundant 
TE superfamily. Furthermore, we show that salamanders are 
unique among vertebrates in their overall genome compo- 
sition; although LTR retrotransposon abundance varies 
among salamanders, LTR retrotransposon levels are higher 
in all sampled salamanders than in other vertebrates 
(fig. 4). This pattern holds, despite 3-fold differences in ge- 
nome size among our focal salamander species, as well as 
limitations in our ability to identify TE-derived sequences 
from low-coverage shotgun data (see below). Thus, LTR ret- 
rotransposons underlie genomic gigantism in extant pletho- 
dontid salamanders and increased LTR proliferation is 
a candidate molecular mechanism underlying genomic ex- 
pansion at the base of the salamander clade. 

Among our six focal species, however, no clear relation- 
ship exists between genome size and TE content or prolif- 
eration dynamics. There are both biological and analytical 
possible explanations for this lack of correlation. First, ge- 
nome size evolution within plethodontids may be shaped 
by factors other than TE proliferation dynamics. For exam- 
ple, selection for smaller genome size has been proposed in 
lineages experiencing metamorphosis, where slow rates of 
cell division and differentiation associated with large ge- 
nomes would extend a vulnerable stage of ontogeny (Wake 
and Marks 1993). Such indirect selection against TE expan- 
sion could impact relative TE abundance (the variable we 
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measured) in many different ways. Second, our analytical 
method may have obscured a true correlation between 
TE content and genome size. Our analysis of low-coverage 
shotgun data underestimates true TE content in predictable 
ways: 1) We miss low-copy-number repeats; RepeatModeler 
requires a minimum number of four sequence copies per data 
set to identify a sequence as repetitive (http://www. 
repeatmasker.org/RepeatModeler.html [date last accessed 
26 Sep 201 1]); 2) We miss noncoding sequence of superfa- 
milies with higher levels of sequence divergence. Our analysis 
requires >92% sequence identity during contig assembly 
(Huang et al. 2003). Thus, we will not obtain full-length or 
near-full-length contigs of older divergent elements and such 
element abundances will be underestimated. Therefore, if ge- 
nomes of our focal taxa differ in the proportion of low fre- 
quency or highly divergent TEs, we will differentially 
underestimate TE content across species. Finally, comparison 
of the LTR sequences from our putative full-length LTR retro- 
transposons with those we mined from Ambystoma BAC 
clones shows that the LTRs of Ty 3/gypsy are much shorter 
in our contigs (supplementary file 2, Supplementary Material 
online); thus, even under the "best" conditions, when ele- 
ments exist in high copy number with low sequence diver- 
gence, we will still underestimate their relative abundance. 
This underestimate is likely to be uniform across all six species, 
but nonetheless contributes to the imprecision in our esti- 
mates of TE content. More generally, our analyses do not take 
into account TE deletion. Removal of TE sequences via both 
small deletions mediated by replication slippage and larger 
deletions mediated by ectopic recombination between TE 
copies is a critical component of TE dynamics that clearly im- 
pacts genome size evolution (Petrov 2002; Bennetzen et al. 
2005). Using low-coverage shotgun data, the tempo and 
mode of DNA/TE loss is much more difficult to estimate than 
that of DNA gain through TE proliferation; however, future 
research aimed at understanding DNA loss is required. Finally, 
we note that other studies have shown a disconnect between 
TE dynamics and evolutionary changes in genome size (e.g., 
Wicker et al. 2009), supporting the view that integration of 
molecular, organismal, and population-level analyses is critical 
for generating a comprehensive picture of genome size evo- 
lution (Gregory 2003; Cavalier-Smith 2005). 

Our results complement recent work describing the genie 
component of the genome of Ambystoma mexicanum 
(Smith et al. 2009), a representative of the salamander fam- 
ily Ambystomatidae and a major model system for labora- 
tory studies in a number of biomedical and basic research 
disciplines (Smith et al. 2005). Ambystomatid salamanders 
diverged from plethodontid salamanders, the focal clade of 
this study, -85-200 Ma (Marjanovic and Laurin 2007; 
Zhang and Wake 2009; Zheng et al. 201 1). BAC sequencing 
in Ambystoma demonstrated that salamander introns are 
substantially longer than human, chicken, and frog introns. 
Thus, increased intron length also contributes to genomic 



expansion in salamanders (Smith et al. 2009), although 
longer introns may reflect TE accumulation. Combining 
analyses that target the genie component of the 
Ambystoma genome (Salamander Genome Project: http:// 
www.ambystoma.org/research/salamander-genome-project 
[date last accessed 26 Sep 2011]), as well as the nongenic 
component from diverse salamander species (current study), 
will ultimately yield a comprehensive picture of the molecular 
processes underlying genomic gigantism in salamanders, as 
it has in other taxa (Bennetzen et al. 2005). 

Recent work has stressed the importance of considering 
the role of population genetic parameters in shaping genome 
size evolution; specifically, in organisms with smaller effective 
population sizes, natural selection is less effective at purging 
slightly deleterious "extra" DNA, which may lead to genome 
size increases (Lynch 2007, 201 1; but see, Whitney et al. 
2011). Under this hypothesis, salamanders are predicted to 
have much smaller effective population sizes than other ver- 
tebrates. However, there is no evidence that this is the case 
(Frankham 1995). Furthermore, using body size as a rough 
proxy for effective population size refutes this hypothesis 
(Organ and Shedlock 2009); salamanders are small relative 
to many other vertebrate taxa. Thus, although stronger ge- 
netic drift in smaller populations may underlie broad patterns 
of genome size evolution across the tree of life, it does not 
appear to explain genomic gigantism in salamanders. 

Across eukaryotes, only a limited number of larger ge- 
nomes have been analyzed in detail because of obvious 
technological and analytical challenges (Ambrozova et al. 
201 1). The majority of such studies have been performed 
in angiosperms, reflecting both their great agricultural im- 
portance and their enormous diversity of genome sizes 
(Bennett and Leitch 2010); however, even such angiosperm 
studies have emphasized genomes toward the smaller end 
of the size range. LTR retrotransposons appear to form the 
majority of most angiosperm genomes (Vitte and Bennetzen 
2006; Huo et al. 2008), and their increased abundance is 
correlated with genome expansion in diverse plant taxa 
(Vitte and Panaud 2005), including Gossypium (cotton) 
(Hawkins et al. 2006), Oryza (rice) (Zuccolo et al. 2007; Gill 
et al. 2010), Eleocharis (family Cyperaceae) (Zedek et al. 
201 0), Vicia (family Fabaceae) (Neumann et al. 2006), maize 
(Sanmiguel et al. 1998), and Helianthus (sunflower) (Staton 
et al. 2009). Finally, Ty3/gypsy LTR retrotransposons are the 
most abundant elements found in the extremely large ge- 
nomes of Fritillaria species (Liliaceae), although the vast 
majority of those —44 Gb-sized genomes remains unchar- 
acterized (Ambrozova et al. 201 1). Fungi, in contrast, have 
small nuclear genomes with comparably limited size varia- 
tion across taxa; only a few outliers reach even 400-700 
Mbp (Kullman et al. 2005). Such "outliers" that have been 
partially characterized (e.g., Gigaspora margarita) contain 
both LTR and non-LTR retrotransposons (Gollotte et al. 
2006). Limited examples of genomic expansion exist from 
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the other main eukaryotic clades; for example, the genome 
of Phytophthora infestans, the chromalveolate pathogen re- 
sponsible for the Irish potato famine in the 1800s, shows 
genomic expansion (genome size 240 Mb) caused by 
proliferation of Ty3/gypsy retrotransposons (Haas et al. 
2009). Within animals, limited cases of genomic gigantism 
are found not only in the deuterostomes (e.g., salamanders, 
lungfishes; see fig. 1) but also within several protostome 
clades; certain lineages of grasshoppers (e.g., genus Podis- 
ma), flatworms (e.g., genus Otomesostoma), and amphipods 
(e.g., genus Ampelisca) have genome sizes estimated at 64, 
21, and 64 Gb, respectively (Gregory 201 1), but the molec- 
ular mechanisms underlying such genomic expansion remain 
largely unknown (Parchem et al. 2010) (but see Bensasson 
et al. 2001 for evidence of slower DNA loss in Podisma). 
Our results in salamanders, coupled with results from several 
angiosperm taxa, indicate that extreme increases in genome 
size may be more likely to reflect expansion of LTR retrotrans- 
posons than other TEs, which could suggest a different bal- 
ance between TE proliferation and silencing among the main 
TE classes. Alternatively, it could suggest that LTR retrotrans- 
posons may more effectively mitigate their deleterious effects 
on the host genome through the targeting of "safe havens" 
for insertion (Gao et al. 2008). Analysis of diverse eukaryotic 
taxa with large genomes is required to rigorously test this hy- 
pothesis. More generally, extending genomic analyses to phy- 
logenetically diverse lineages with large genomes will be 
critical for generating a more complete picture of eukaryotic 
genome evolution (Ambrozova et al. 201 1; Voss et al. 201 1). 
Our work, as well as other recent studies using low-coverage 
data to characterize repeat element landscapes, suggests that 
such analyses are now feasible, despite the fact that assem- 
bling large repetitive genomes remains intractable (Macas 
et al. 2007; Castoe et al. 201 1). 

Although the TE landscape of salamanders is the focus of 
our work (as it provides a potential mechanism for genomic 
expansion), many researchers target the single- or low-copy 
sequences within a genome for analyses ranging from pro- 
tein function to phylogenetic history. Such studies are ham- 
pered by unknown repetitive landscapes; without 
a database of known TEs, homology-based repeat-masking 
analyses are ineffective. Our work will benefit researchers 
targeting the single- or low-copy sequences within salaman- 
ders by providing such a database of TEs. More generally, 
the pipeline we developed can be used by any researcher 
to generate a similar database in an unexplored genome, 
provided the TEs exist in sufficiently high copy number with 
sufficient sequence identify. Thus, our work also contributes 
to other fields (e.g., phylogenetic systematics and popula- 
tion genetics) transitioning to large-scale genomic data sets 
(Thomson et al. 2010). 

For decades, evolutionary biologists have inferred that 
salamanders' huge genomes relative to other vertebrates 
are related to the clade's extremely low metabolic rates, just 



as the compact genomes of birds and flying mammals are 
linked to high metabolic rates (Olmo and Morescalchi 1 975; 
Szarski 1983; Burton et al. 1989; Cavalier-Smith 1991; 
Gatten et al. 1992; Waltari and Edwards 2002). Mechanis- 
tically, this inverse relationship between genome size and 
metabolic rate has been explained in several subtly different 
ways that build on the positive correlation between genome 
size and cell size and, more specifically, the low cell surface- 
to-volume ratios associated with large cells (Olmo and 
Morescalchi 1975; Szarski 1983; Lay and Baldwin 1999; 
Kozlowski et al. 2003; Kozlowski et al. 2010). Within 
salamanders, however, no strong correlation exists between 
metabolic rate and genome size, suggesting that other fac- 
tors drive among-lineage genome size variation within the 
clade (Gregory 2003). A mechanistic link between large ge- 
nomes and low metabolic rate remains the topic of debate, 
as does the adaptive significance of genomic expansion in 
salamanders (Cavalier-Smith 1991; Roth etal. 1994). How- 
ever, we emphasize that a full understanding of the forces 
shaping genome expansion in this clade requires integrating 
detailed analyses of molecular mechanisms into tests of 
these long-standing physiological hypotheses. Our results 
represent a first step toward such a comprehensive picture 
of salamander genomics that considers evolutionary forces 
acting at the genome, cell, organism, and population levels. 
Future studies aimed at the balance between host-mediated 
TE silencing and TE proliferation in salamanders, particularly 
for LTR retrotransposons, will add to this picture, as will anal- 
yses integrating genomic and organismal data in an explicit 
phylogenetic context. 
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Supplementary files 1-6 are available at Genome Biology 
and Evolution online (http://www.gbe.oxfordjournals.org/). 
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